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ABSTRACT 

We present a scalable parallel solver for numerical constraint 
satisfaction problems (NCSPs). Onr parallelization scheme 
consists of homogeneous worker solvers, each of which runs 
on an available core and communicates with others via the 
global load balancing (GLB) method. The parallel solver is 
implemented with XIO that provides an implementation of 
GLB as a library. In experiments, several NGSPs from the 
literature were solved and attained up to 516-fold speedup 
using 600 cores of the TSUBAME2.5 supercomputer. 

Categories and Subject Descriptors 

D.1.3 [Concurrent Programming]: Parallel Programming 


General Terms 

Algorithms, Artificial Intelligence 
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1. INTRODUCTION 

Numerical constraint satisfaction problems (NCSPs, Sec¬ 
tion and their dedicated solvers have been successfnlly 
applied to problems in the domain of real numbers [22[ 

f . Given a NCSP with a search space represented by a box 
.e., interval vectors), the branch and prune algorithm bi¬ 
sects the box or filters an inconsistent portion of the box 
repeatedly, until finally obtaining a paving (i.e., a set of 
boxes that precisely enclose the solutions set). However, 
the exponential compntational complexity of NCSPs limits 
the number of tractable instances; therefore, parallelization 
of NCSP solvers that can scale on a number of cores is a 
promising approach for the further development of nnmeri- 
cal constraint programming [^. 

Search-space splitting is a simple and efficient approach for 
parallelization of CSP solvers, yet state-of-the-art implemen¬ 
tations are still limited to scaling up to a few hundred cores 
(Section]^. Recently, Saraswat et al. have proposed a 
global load balancing framework: a scalable scheme for the 
global workload distribution and termination detection of ir¬ 
regular parallel computation, which typically applies to the 
CSP solving process (Section |4.1| . That framework is imple¬ 
mented with XIO and is available in the official distribution 
of XIO as the GLB library . 


In this work, we propose a parallel NCSP solver that uses 
the GLB library (Section |4.2[ ). The solver is simply im¬ 
plemented with XIO by adopting the (sequential) constraint 
propagator of the Realpaver solver as a unit process of GLB. 
Section reports the experimental results obtained when 
our method was deployed on 600 cores of the TSUBAME2.5 
supercomputer. Optimal configurations of the GLB param¬ 
eters are analyzed on the basis of those experimental results. 


2. RELATED WORK 

A survey by Gent et al. describes existing parallel CSP 
solvers by classifying them into three categories: search- 
space splitting methods, cooperative methods for heteroge¬ 
neous workers (e.g. portfolios and parallel local search) 
and parallelization of the constraint propagation process 
Our work focuses on the first approach. 

The main difficulty of the search-space splitting approach 
lies in the balanced distribution of the sub-trees, which keeps 
worker solvers active. When the search tree becomes highly 
unbalanced, it becomes difficult to predict the appropriate 
splitting of the search tree, so a dynamic load balancing 
scheme becomes necessary. A work stealing scheme is typ¬ 
ically used for this purpose: when a worker is starving, it 
sends a request to other workers, and workloads are commu¬ 
nicated in response. Most existing works experiment with a 
limited number (e.g., 40 [^, 64 14 or 256 [^[^) of pro¬ 
cessors, and the load balancing tends to assume a central 
master process, which may limit scalability [23[ [^. 

A substantial amount of work exists regarding the par¬ 
allelization of the branch and bound algorithm with both 
search-space splitting and work stealing [H [W 18 . Al¬ 
though the branch and bound algorithm resembles the solv¬ 
ing process of NCSPs, existing works consider the discrete 
domain; our work explores an efficient parallel method that 
handles the continuous domain with interval computation. 

There exist parallel CSP/SAT solvers implemented with 
XIO mimziiizi- None has yet utilized the GLB library in 
its implementation. 


3. NUMERICAL CONSTRAINT 
PROGRAMMING 

Numerical constraint programming is an extension of dis¬ 
crete constraint programming and uses techniques that 
are inherited from interval analysis jl6| . Their variable do¬ 
mains are continuous subsets of R: (machine-representable) 
intervals [a, 6] := {r € R j a < r < b}, where a and b are 
floating-point numbers, and boxes (or vectors of intervals) 







([ai, foi],..., [a„, b„]). In the following, boldface characters 
(e.g., v) denote intervals and boxes. I denotes the set of 
intervals and I" denotes the set of n dimensional boxes. Nu¬ 
merical constraint solving resorts to validated interval com¬ 
putation and the branch and prune scheme: the solvers eval¬ 
uate an interval extension of a real function in a reliable 
manner, e.g., an interval extension of addition is computed 
as [fli, &i]-I-[ 02 , 62 ] := [[01 + ^ 2 ], \bi+b 2 ]] where [-J and [■] 
denote the downward and upward rounding mode control, 
and the branch and prune algorithm enumerates interval as¬ 
signments based on the dichotomy principle. 

A numerical constraint satisfaction problem (NCSP) is de¬ 
fined as a triple (v,vo,c) that consists of a vector of vari¬ 
ables V = (vi, ... ,Vn), an initial domain in the form of a 
box uo G I", and a constraint c(v) := f{v) = 0 A g{v) < 0, 
where / : R" —and g- : R"' —>• R’*®', i.e., a conjunction 
of n/ equations and Ug inequalities. A solution of a NCSP is 
an assignment of its variables h € uo that satishes the con¬ 
straint c{v). The solution set E is the region {h G uq | c(i;)}. 
A NCSP is well-constrained when n = Uf, under-constrained 
when n > Uf, and over-constrained when n < Uf. In gen¬ 
eral, a well-constrained NCSP has a discrete solution set and 
an under-constrained NCSP has a continuous solution set. 

Example 1. We can model the intersection of two disks 
in the (vi,V 2 ) plane as an under-constrained NCSP, where 
V := (vi,...,V4), Vo = ([-1,1], [-1,1], [0,1], [0,1]), and 

C = (Vi -t-vi - V 3 , (vi - 1)^ -I-V2 - V4) = 0. 

The solution set projected onto the (ui, V2) plane is depicted 
in Figure 

3.1 Branch and Prune Algorithm 

The branch and prune algorithm is the standard solv¬ 
ing method for NCSPs. It takes a NCSP and a precision 
e as input and outputs a set of boxes (or paving) S that 
approximates the solution set with precision e. 

Figure presents the algorithm. In the main loop at 
Lines 2-11, the algorithm first takes the first element of 
the queue L of boxes and applies the Prune procedure that 
shaves boundary portions of the considered box (Line 3). In 
this work, we use a basic implementation HC4Revise for 
well-constrained problems; for under-constrained problems, 
we use an implementation proposed in that provides a 
verification process based on an interval Newton method 
combined with HC4Revise. As a result of Prune, a box be¬ 
comes either empty, precise enough (its width is smaller than 
e), verified as an inner box of the solution set E, or unde¬ 
cided. Precise and inner boxes are appended to S (Line 6 ) 
and undecided boxes are passed to Branch. Next, the Branch 
procedure bisects the box at the midpoint along a compo¬ 
nent corresponding to a variable and the sub-boxes are put 
back into the queue (Line 8 ). In this work, we assume that 
Branch selects variables in an order that makes the search 
behave in a breadth-first manner. 

Realpaver [12| has been developed as a (sequential) imple¬ 
mentation of a NCSP solver. 

Example 2. A set of boxes enclosing the solution set of the 
NCSP from Example[^ which is computed with e = 0.01, is 
shown in Figure 

There are some characteristics that make the parallel solv¬ 
ing of NCSPs different from that of other CSPs. First, com¬ 
puting Prune is expensive and causes a bottleneck during 


Input: NCSP (v,vo,c), precision e 
Output: set of boxes S 
1: L:={uo};S':=0 
2: while L 7^ 0 do 
3: V := Prunec(PollFirst(L)) 

4: if u 7 ^ 0 then 

5: if wid u < e V u is an inner box then 

6 : S':=S'U{u} 

7: else 

8 : L AddLast(L, Branch(u)) 

9: end if 

10: end if 

11: end while 
12: return S 

Figure 1: Branch and prune algorithm 



Figure 2: Paving of a solution set 

the solving process; in our experiments, a Prune call takes 
around 1ms in average. Second, Prune applications result 
in unbalanced search trees. Under certain conditions. Prune 
contracts a large portion of a considered box (cf. quadratic 
convergence of the interval Newton methods) and may even 
filter out the entire box if the box in question is verified as 
an inner or totally inconsistent region. Thus, it is crucial for 
efficient NCSP solving to execute Prune at each step while 
traversing the search tree, which, on the other hand, makes 
it more difhcult to distribute a search path among proces¬ 
sors. Third, the number of solutions is large when a problem 
is under-constrained and a small e is given; this causes the 
search tree to spread toward the bottom. Last, the search 
processes for different branches do not require communica¬ 
tion; thus, it is safe to run processes on different cores in 
parallel, without any modihcation. 

4. PARALLELIZATION USING 
GLOBAL LOAD BALANCING 

We parallelize the branch and prune algorithm by split¬ 
ting and distributing a search space (represented as queue L 
content) among the workers that run the branch and prune 
processes homogeneously. A balanced distribution is not 
straightforward; a naive method is to create a frontier of 
sufficient number of nodes in the search tree, and distribute 
them evenly across the workers; however, a breadth-first 
search computation of such a frontier is not efficient because 
of the time-consuming Prune process. 

The proposed method is implemented simply with XIO 
and the global load balancing (GLB), which is an efficient 
scheme for load balancing of irregular tasks. 





































Input: environment E, TaskQueue instance Q 
Output: task result 
Parameter: i € R>o, w, Z, a € N 
1 : Z/i lnitLifeline_B(/, z) 

2: repeat 

3: while Q.process{i) do {active phase} 

4 : DistributeToThieves£;(Q) 

5: end while 

{idle phase 1 } 

6: for {j := 1; j < w A Q.empty, }++) do 

7 : TryStealFrom^(Randomld()) 

8: end for 

{idle phase 2 } 

9: for (} := 1; j < LL.length A Q.empty, }++) do 

10: if -^LL{j).activated then 

11: LL{j).activated := true 

12: TryStealFromg(Z/i(j)) 

13: end if 

14: end for 

15: until Q. empty 

16: return Q.getResultO 

Figure 3: Worker process 


4.1 XIO GLB Library 

GLB is a global load balancing library in the XIO 
standard library that implements the lifeline graph work¬ 
stealing algorithm [20| . GLB is suitable for parallelizing 
irregular tasks, where the workload for each subtask is not 
predictable, such as search algorithms for AI applications. 

GLB computation is performed by multiple cooperative 
workers. Each worker runs on an XIO place and homoge¬ 
neously processes a divided workload. The load balanciug 
between workers is done in two phases: first, work stealing 
via requests sent from one worker to randomly selected other 
workers; then, work stealing and termination detection via 
a hyper-cube network of workers called a lifeline. GLB is 
simply implemented with XIO with configurable parameters 
and scales up to 16K places when applied to several bench¬ 
marks. For each GLB application, a sequential computa¬ 
tion that processes a workload is implemented as an XIO 
class TaskQueue and an instance is given to a worker as in¬ 
put. There are four parameters: i € R>o specifies the lower 
bound on the time (in seconds) taken by a unit of sequential 
proces^ w € N specifies the number of attempted workload 
steals in the first phase; and I G N and z G N specify the 
diameter of the lifeline graph and the number of branches of 
each node, respectively, w = 0 turns off the random steal¬ 
ing process. We assume E is greater than or equals to the 
number of workers. A tight lifeline graph is built by setting 
I — 2', broadcasting is done in two hops. 

A pseudo algorithm in Figure mimics the worker pro¬ 
cess. When the TaskQueue instance Q contains a workload, 
a worker becomes active and iterates the loop at Lines 3- 
5. A call to the process(i) method of Q invokes a unit se¬ 
quential computation that should take at least i seconds. 
DistributeToThieves sends portions of the available workload 
(i.e. Q.split))) to other idling workers; otherwise, it signals 
that the worker has no workload available. When all work¬ 
load is processed and property Q.empty becomes true, the 

^We modified GLB to use the parameter i instead of n, 
which specified the number of processed unit tasks. 


worker enters the two-stage idle phase: at Lines 6 - 8 , the 
worker randomly selects another worker, sends a request, 
and waits for the victim’s DistributeToThieves to respond; 
at Lines 9-14, the worker sends a request following the life¬ 
line graph. When the steal succeeds, the loot L is merged 
by executing ( 5 .merge(L). When Q is still empty, the worker 
process terminates and outputs the task result. 

4.2 Implementation of NCSP Solver with GLB 

We implement TaskQueue to encapsulate computation of 
the branch and prune algorithm. TaskQueue holds the queue 
L of undecided boxes and the solution set S. Initially, a 
worker possesses the initial domain no in L and the queues 
of other workers are left empty. Methods of TaskQueue are 
implemented as follows: 

• process(i) computes the main loop of the branch and 
prune algorithm until the time i elapses. For Prune, 
the C-F-F implementation of Realpaver is used. 

• split)) divides L equally into two and returns a portion. 

• merge)!/) is given a set L of boxes and appends the 
boxes to L. 

• getResult)) returns |5|. Our implementation does not 
gather S in one place, thus avoiding unnecessary over¬ 
head. Indeed, S might be better distributed in the 
post-process of many applications. 

The implementation consists of about 1,000 lines of XIO 
and 2,400 lines of G-F-F code. In NCSP applications, the 
solving process must be tweaked by trying several combina¬ 
tions of Prune implementations and search strategies. In this 
respect, our simple XIO framework that is interfaced with 
C-F-F solver implementations will serve as a practical tool. 


5. PERFORMANCE EVALUATION 

We evaluated our parallel NCSP solver with several prob¬ 
lems from existing literature. The experiments were oper¬ 
ated on the TSUBAME2.5 supercomputer, which is a super¬ 
computer at Tokyo Institute of Technology]^ Each node of 
TSUBAME2.5 has two Intel Westmere EP 2.93GHz proces¬ 
sors (12 cores in total) and 54GB of local memory. We used 
50 nodes; thus, each experiment was run with up to 600 XIO 
places on 600 cores. We used native XIO version 2.4.3.2 and 
its MPI backend (based on Open MPI 1.6.5). 


5.1 Experimental Results 

We solved four instances of two well-constrained (WC) 
problems taken from 13 10 and six instances of two under¬ 


constrained (UC) problems taken from For each of 

the first three problems, we prepared two instances by vary¬ 
ing the number of variables and constraints. For the UC 
problems, we solved with two different precisions. Every in¬ 
stance was solved with the following seven GLB parameter 
configurations: 


(1) i = 0.001s, 1 = 2, and w = 0. 

(2) i = 0.001s, 1 = 2, and w = 1. 

(3) i = 0.001s, 1 = 2, and w = z. 

(4) i = 0.001s, I = P, and w = 0. 

(5) i = 0.001s, I = P, and w = z. 

(6) i = 0.1s, I = 2, and w = 0. 

(7) i = 0.1s, I = 2, and w = z. 

^http://tsubame.gsic.titech.ac.jp/en 











Table 1: Considered problems and experimental results 


problem 

size 

e 

# sol 

#br 

w 

t\ 

tsoo 

teoo 

areoo 

# sbeoo 

Economics 

8 

10 '^ 

8 

63 478 

0 

58s 

0.40s 

0.41s 

64% 

47 000 

{eco8) 





1 


0.78s 

0.77s 

25% 

27 500 

Economics 

10 

10 ““ 

16 

3 614 945 

0 

5 970s 

22 .0s 

11 .8s 

88 % 

2 550 000 

{ecolO) 





1 


21.4s 

11.5s 

93% 

1 150 000 

Periodic orbits 

48 

10 ““ 

2 939 

28 742 

0 

1 330s 

8.5s 

5.0s 

58% 

34 800 

{henon24) 





1 


6 .8s 

4.4s 

63% 

19 000 

Periodic orbits 

56 

10 ““ 

16 105 

174 446 

0 

12 530s 

60.2s 

31.2s 

65% 

201 000 

{henon28) 





1 


45.7s 

25.1s 

87% 

81 000 

2D sphere 

2+2 

0.004 

312 064 

364 961 

0 

122 s 

0 .6s 

0.5s 

75% 

295 000 

and plane 





1 


0 .8s 

0.9s 

39% 

153 000 

{sp2-2) 


0.001 

2 490 988 

2 936 705 

0 

780s 

3.8s 

2 .2s 

87% 

2 300 000 






1 


3.9s 

2 .6s 

78% 

955 000 

4D sphere 

2+4 

0.004 

1 459 225 

2 488 689 

0 

1 202s 

7.0s 

3.8s 

85% 

1 790 000 

and plane 





1 


6 .6s 

4.1s 

85% 

662 000 

{sp2-4) 


0.001 

11 759 158 

20 082 197 

0 

12 800s 

52s 

27s 

92% 

12 850 000 






1 


50s 

26s 

97% 

4 600 000 

3-RPR robot 

3+3 

0.2 

1 488 388 

1 936 939 

0 

598s 

2 .8s 

1.7s 

80% 

1 550 000 

{3rpr) 





1 


2.9s 

2 .1s 

71% 

675 000 



0.1 

5 649 780 

7 186 845 

0 

2 135s 

9.0s 

5.0s 

86 % 

5 600 000 






1 


8.7s 

5.2s 

88 % 

2 300 000 


z was set as [log; P] (such that P > P)- 

Specification of the instances and experimental results us¬ 
ing configurations (1) and (2), which were performed most 
efficiently, are shown in Table The columns “problem”, 
“size”, “e”, sol”, br”, and “ui” represent the name of 
the problem, size (i.e., the number of variables n; for UC 
problems, the number of equality constraints n/ that are 
separated with precision, number of solutions, number 
of branches, and value of w, respectively. The rest of the 
columns provide some experimental results, tj represents 
the running time using j XIO places (best timings are under¬ 
lined). areoo represents the mean of the ratio of active time 
versus the total solving time at each place when computed 
with 600 places. ^ sbgQQ represents the total number of sent 
boxes from 600 places for load balancing. Figurej^shows the 
number of paths per depth in each search tree of the four 
instances. Figure illustrates the speedups of the parallel 
solving processes for the seven instances. Figure [^illustrates 
the fraction of the three worker states within the CPU tim¬ 
ing for the two instances. Each layer, from bottom to top, 
corresponds to the time taken for Prune, DistributeToThieves, 
and the idle phase (Lines 6-14 in Figure j^, respectivelyj^ 

tlfi^S^ii^SiftJJts, our parallel solver scaled up to 600 
places/cores and achieved up to 516-fold speedup (an effi¬ 
ciency of 0.84). 

As can be seen from Figure [^ each of the search trees 
for the instances considered has a number of paths whose 
lengths are close to the height of the tree; thus, a certain level 
of parallelism exists. Furthermore, comparing the graphs 
of the different instances of the same problem, we see that 
the shapes of graphs are similar, but the size of the tree 
of larger instances increases exponentially, indicating that 
parallelization should be easier for larger instances. 


®The timings for Prune differed per number of places. As a 
reason, we predicted and confirmed that the CPU cache hit 
ratio differed in the parallel processes. 
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(b) sp2-2, e = 0.004 (left); sp2-4, e = 0.001 (right). 


Figure 4: Number of search paths (vertical axis) 
per depth (horizontal axis) 


For most instances (excluding the large ones), the best 
speedups were accomplished with configuration (1): the par¬ 
allel process with the most frequent load balancing that used 
the lifeline graph with the broadest distribution and did not 
perform random stealing. The efficiency of load balancing 
is evident in the active ratio of workers (see areoo in Table 
and the left-hand figures of Figure]^. Despite its large com¬ 
munication overhead (see the last column of Table , load 
balancing that used the lifeline resulted in quick workload 
distribution and termination. 

For large instances, such as ecolO (Figure |5(a)[ right), 
henon28, and sp4 (e = 0.001; Figure 5(b)| right), configura- 
tions (2), (3), and (5), which had frequent random stealing, 
outperformed configuration (1). Their performance also ap¬ 
peared in the active ratio (Figure [6(a)| right). Among these 
configurations, configuration (2) performed the best proba¬ 
bly because of its quick termination process. 

Regarding time interval i of the load balancing, the most 
frequent setting, i = 0.001s, performed well. With this set- 
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Figure 5: Speedup of the solving process 
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Figure 6: Breakdown of CPU time with timings for pruning, load sending, and idling 
















































ting, workload was distributed between almost every call 
to Prune which takes around 1ms on average and, in total, 
occupies around 90% of the running time. 

Configuration (4) always performed poorly. Its load bal¬ 
ancing, which used a lifeline formed as a ID hyper cube 
(i.e. ring), was slow and led to its poor performance. Its 
performance improved dramatically by enabling the random 
stealing process (configuration (5)). 

In some experiments, such as sp4 (e = 0.004; Figure [5(b)[ 
middle), Srpr (e = 0.1; Figure 5(c)[ right), and 3rrr (e = 
0 . 1 ), configuration ( 1 ) scaled well and outperformed other 
configurations when using 600 cores. It would appear that 
the random stealing process, when using many cores, suf¬ 
fered from large communication overhead; such large over¬ 
head can be confirmed with configuration ( 2 ), shown in the 
right-hand side of Figure [ 6 (b)[ 

Finally, the running times of some experiments were quite 
short. For example, eco8 and sp2-2 (e = 0.004) took 58s and 
122 s, respectively, with a single core, and certain speedups 
were achieved: 141- and 252-fold with 600 cores. 


6. CONCLUSIONS 

In this work, we show that the parallelization of the branch 
and prune search is a good application of the XIO GLB 
framework. In the experiments, we achieved nearly linear 
speedups up to 600 XIO places/cores and are expected to 
be able to scale further. In future work, we plan further ex¬ 
periments on realistic problems including optimization prob¬ 
lems, using a greater number of cores. 
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